Concurrency and Text File Processing

Mark Leighton Fisher on 2007-05-18T17:05:36

Concurrency will continue to become more important, since it is getting cheaper to add more CPUs to a chip than to speed up the chip. (After all, we don't have one big, fast neuron in our brains.) Much of programming is driven by some form of text file, as even when the data is binary, the source files for the program are still text. So when can we take advantage of concurrency during our text file processing?

The question is of one of structure – how much and what kind of structure exists in the text file. In the text of programming languages, structure goes from the free-for-all of C, Perl, Ruby, and that ilk, to the line-structured Python and Fortran (any time you force a certain indentation, you have implicitly forced a line-oriented structure) (and some of us remember RPG II).

Ordering is the other question. Parsing the characters into lines is inherently a sequential operation. (Parsing characters into any structured form is inherently a sequential operation.) Once you have the primary structured form, only then can you process the text in a parallel fashion. Logfiles are one example of a text file format amenable to parallel processing once they have been reduced into lines ("see how many U.S. government users your Apache server saw in the past month" is a parallelizable operation on your Apache logfile). As a partial counter-example, "how many of your routines implicitly returned an Int greater than 30" will likely require knowledge of your program's structure at more than just the line level (except maybe if you are programming in APL).

Parallel processing by definition requires 2+ things to process (a thing can't be parallel to itself). If what you are processing is one big interconnected thing, though it may be divisible into smaller sub-things, then you can't parallel process it. Google Language Tools (IIRC) uses statistical text processing to derive the translated phrases (statistical analysis of that sort can be parallelized). A hypothetical True And Correct Natural Language Translator(tm) would require some understanding of the whole text to create translations in all cases, as material later in the text can require understanding of material earlier in the text to translate it correctly. (Fortunately, that usually isn't the case with the web pages I've had Google Language Tools translate for me.)

I'm wondering if the Unix/Linux model of separate coordinating processes (an MIMD model) would be more scalable over the long term than the vector-processing/SIMD model I keep hearing about from today's concurrency proponents. It may be no accident that some of the biggest concurrency successes in current software have been printing and webpage loading, as those are sequential processes that can be executed separately from the main locus of control.


Skip a Step

chromatic on 2007-05-19T01:20:41

If you avoid glomming all of those lines together in a single text file, you can avoid having to scan that file sequentially before you can parallelize. This can be handy if you need to process logfiles in parallel.

Re:Skip a Step

Aristotle on 2007-05-19T21:11:31

Dan Bernstein was onto something in more ways than one, obviously.

Re:Skip a Step

chromatic on 2007-05-20T01:53:18

I had in mind mod_log_sql or whatever it is, but special syslog magic could also apply.

Re:Skip a Step

Aristotle on 2007-05-20T02:50:15

Oh, I wasn’t just talking about logfiles. Same’s true for any structured data.

Re:Skip a Step

Mark Leighton Fisher on 2007-06-22T17:01:19

Too true. But sometimes we still have text files to deal with – I didn't mean to exclude sidestepping the ordered-bit problem by storing the data in random-access fashion.

what are you implying...

rjbs on 2007-05-21T02:52:28

Concurrency will continue to become more important, since it is getting cheaper to add more CPUs to a chip than to speed up the chip. (After all, we don't have one big, fast neuron in our brains.)
Are you trying to say that our brains work the way they do because God is a cheapskate? Given the erratic behavior I observe in people, I'd think of Him more as an overclocker.